feat(plugins-soniox): surface per-run language segments#1602
feat(plugins-soniox): surface per-run language segments#1602rosetta-livekit-bot[bot] wants to merge 4 commits into
Conversation
🦋 Changeset detectedLatest commit: 836f146 The changes in this PR will be included in the next version bump. This PR includes changesets to release 34 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Co-authored-by: rosetta-livekit-bot[bot] <282703043+rosetta-livekit-bot[bot]@users.noreply.github.com>
| if (data === SpeechStream.FLUSH_SENTINEL) { | ||
| continue; | ||
| } | ||
| ws.send(data.data.buffer); |
There was a problem hiding this comment.
🔴 Sending data.data.buffer may transmit incorrect bytes when AudioFrame's typed array is a view into a larger ArrayBuffer
In #sendAudio, ws.send(data.data.buffer) sends the entire underlying ArrayBuffer of the Int16Array. If the AudioFrame.data typed array is a view with a non-zero byteOffset or doesn't span the full buffer (e.g., after resampling or slicing), this sends more/wrong bytes than intended. Other plugins (e.g., Deepgram, ElevenLabs) typically send the typed array directly or use Buffer.from(data.buffer, data.byteOffset, data.byteLength) to handle this correctly.
Affected code in stt.ts
ws.send(data.data.buffer) should be ws.send(Buffer.from(data.data.buffer, data.data.byteOffset, data.data.byteLength)) or simply ws.send(data.data) which the ws library handles correctly for typed arrays.
| ws.send(data.data.buffer); | |
| ws.send(Buffer.from(data.data.buffer, data.data.byteOffset, data.data.byteLength)); |
Was this helpful? React with 👍 or 👎 to provide feedback.
| }); | ||
|
|
||
| try { | ||
| await Promise.race([sendTask, listenTask, waitForAbort(this.abortSignal)]); |
There was a problem hiding this comment.
🔴 Using Promise.race instead of waiting for both send and listen tasks causes premature WebSocket closure and lost final transcripts
In #runWS, await Promise.race([sendTask, listenTask, waitForAbort(this.abortSignal)]) means that when the audio input ends and sendTask resolves, the code immediately enters the finally block and closes the WebSocket — without waiting for the server to send its final transcription and finished message. Unlike Deepgram which uses Promise.all([sendTask(), listenTask.result, ...]) to wait for both sides to complete, this plugin closes the connection before receiving the server's final response. While message handlers are technically still attached during the WebSocket closing handshake, this relies on fragile timing behavior and the server being fast enough to flush before the close completes.
Prompt for agents
In plugins/soniox/src/stt.ts in the #runWS method, the Promise.race on line 262 causes the WebSocket to close as soon as sendTask resolves (audio input ends), without waiting for the server to send back its final transcription and 'finished' message. The fix should restructure this so that after sendTask completes, the code waits for listenTask to resolve (i.e., the server sends finished or error), while still respecting the abort signal. A common pattern (used by the Deepgram plugin) is to use Promise.all for sendTask+listenTask and then race that against the abort signal. Something like: await Promise.race([Promise.all([sendTask, listenTask]), waitForAbort(this.abortSignal)]). This ensures the server has time to flush remaining transcriptions after audio input ends.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Fixes #5685 (and the follow-up source-side symptom raised in the comment thread, which @chenghao-mou approved bundling into the same PR).
Both halves are the same plugin bug:
_TokenAccumulator._lang_segmentsis built per-run by the existing coalescing logic but then dropped insend_endpoint_transcript(and the interim path). The fix surfaces it through newSpeechDatafields on the target side, and stops dropping it on the source side in non-translation mode.Changes
stt.SpeechData: addtarget_languages/target_texts(symmetric to existingsource_languages/source_texts). SameLanguageCodecoercion in__post_init__. DefaultNone, so the addition is strictly additive for every other plugin.target_*fromfinal._lang_segmentsonFINAL_TRANSCRIPTandINTERIM_TRANSCRIPT/PREFLIGHT_TRANSCRIPT. Consumers now see the per-run target breakdown for code-switched two-way translation, e.g.target_languages=["en", "es"]/target_texts=["Hello, how are you?", " Estoy bien, gracias."]for the translation of"Hello, ¿cómo estás? I'm doing fine, gracias.".source_*from the same accumulator (previouslyNone). A code-switchedja+enutterance now surfacessource_languages=["ja", "en"]/source_texts=["こんにちは、私の名はサムです。", " My name is Sam."]-- matches what theSpeechDatadocstring already promised for "multi-language detection services"._lang_segments_to_fieldshelper to DRY the conversion across both modes and both event paths; the four duplicated inline list comprehensions collapse to one named operation. The predicate that distinguishes source from target became data-presence-based (final_original._lang_segments) rather than config-based (is_translation_mode is not None), which is what unified both halves cleanly.SpeechData.textandSpeechData.languageare unchanged for back-compat (still the full concatenation and the first translated/detected language, respectively).Test plan
tests/test_plugin_soniox_stt.pycovering:SpeechData.__post_init__target_languagescoercion (strings →LanguageCode,NonestaysNone, existingLanguageCodepassthrough)_TokenAccumulator._lang_segmentsper-run coalescing_lang_segments_to_fieldshelper edge cases (empty →(None, None), non-empty → parallel lists withLanguageCodecoercion)source_*from final + non-final mergedsource_*populated,target_*Nonesource_*carries the per-run breakdown"Hello, ¿cómo estás? I'm doing fine, gracias."→text="Hello, how are you? Estoy bien, gracias.",target_languages=["en", "es"],target_texts=["Hello, how are you?", " Estoy bien, gracias."],"".join(target_texts) == text. Source side unchanged." こんにちは、私の名はサムです。 My name is Sam."→text=" こんにちは、私の名はサムです。 My name is Sam.",source_languages=["ja", "en"],source_texts=[" こんにちは、私の名はサムです。", " My name is Sam."],target_*correctlyNone. Interim events also surface the multi-language source breakdown progressively as the user code-switches.ruff formatclean,ruff checkclean, no newmypy --stricterrors introduced in changed files.Follow-ups (intentionally not in this PR)
final/final_originalaccumulator names are honest about routing today but the newtarget_*fields make their two-mode roles more glaring (finalis "primary user-facing accumulator",final_originalis "source-side accumulator that's empty in non-translation mode"). Worth a separate behavior-preserving rename PR tofinal_primary/final_source.target_*fields are wired in Soniox only; other translation-capable plugins (Gladia, Deepgram v2, AWS) can adopt them in follow-up PRs.